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Abstract 

In this paper, we consider the Minimum Reaction Insertion (MRI) problem for finding the minimum number of additional 
reactions from a reference metabolic network to a host metabolic network so that a target compound becomes producible 
in the revised host metabolic network in a Boolean model. Although a similar problem for larger networks is solvable in a 
flux balance analysis (FBA)-based model, the solution of the FBA-based model tends to include more reactions than that of 
the Boolean model. However, solving MRI using the Boolean model is computationally more expensive than using the FBA- 
based model since the Boolean model needs more integer variables. Therefore, in this study, to solve MRI for larger 
networks in the Boolean model, we have developed an efficient Integer Programming formalization method in which the 
number of integer variables is reduced by the notion of feedback vertex set and minimal valid assignment. As a result of 
computer experiments conducted using the data of metabolic networks of £ coli and reference networks downloaded from 
the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, we have found that the developed method can 
appropriately solve MRI in the Boolean model and is applicable to large scale-networks for which an exhaustive search does 
not work. We have also compared the developed method with the existing connectivity-based methods and FBA-based 
methods, and show the difference between the solutions of our method and the existing methods. A theoretical analysis of 
MRI is also conducted, and the NP-completeness of MRI is proved in the Boolean model. Our developed software is available 
at "http://sunflower.kuicr.kyoto-u.ac.jp/~rogi/minRect/minRect.html." 
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Introduction 

Metabolism is one of the most important biological processes in 
organisms. Relations between reactions and chemicals in the 
metabolism are often represented by metabolic networks [1] . Since 
many of these metabolic processes can produce commodity and 
specialty chemicals, the manipulation of metabolisms has been 
extensively studied in the field of metabolic engineering. One of 
the most successful applications of metabolic engineering is 
production of industrially valuable products using a microbial 
host with recombinant technologies [2-4]. Techniques for 
production of desired chemicals using a microbial host are roughly 
classified into the following three types [5]: (a)combinations of 
existing pathways, (b)engineering of existing pathways, and (c) de 
novo pathway design. In (a), partial pathways can be recruited from 
independent organisms and co-localized in a single host. For 
example, 1,3-propanediol is synthesized by Nakamura et al. in 
which pathways from Saccharomyces cerevisiae and Klebsiella pneumonia 
were assembled in E. coli [6] and another example is the 



production of artemisinic acid, a precursor to the plant-based 
anti-malarial drug artemisinin in yeast [7] . In (b), new non-natural 
chemicals can be produced by engineering existing routes [8,9]. (c) 
is realized by the combination of (a) and (b), that is, the 
recruitment of partial pathways from different species and the 
use of engineered enzymes for extensions of pathways. It is to be 
noted that (a) focuses on the topology of the given metabolic 
networks, while (b) and (c) utilize the information of the structures 
of chemicals as well. 

The "pathway prediction system" (PPS) of the University of 
Minnesota Biocatalysis and Biodegradation Database (UM-BBD) 
is designed to predict routes for the biodegradation of xenobiotic 
compounds [10-12]. From a set of previously defined biotrans- 
formation rules, the PPS guides the user through potential 
pathways one step at a time, requiring the selection of a new 
target metabolite at each step [5] . Biochemical Network Integrated 
Computational Explorer (BNICE) is a computational framework 
for generating every possible biochemical reaction from a given set 
of enzyme reaction rules and source or target compounds [13,14]. 
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Figure 1. A problem of how to produce a target compound from the source nodes. In the Boolean model, either {R1, R4} or {R1, R2, R3) is 

sufficient, whereas {R1, R2, R3, R4, R5} is necessary for the flow model including FBA. 

doi:10.1371/journal.pone.0092637.g001 



However, since the number of predicted novel pathways is huge in 
many cases, some prioritization is necessary to choose the most 
promiscuous ones [15]. For example, one measure of such 
prioritization is to minimize the number of enzymatic steps [16]. 

In the type (a) problem, it seems that there are three major 
models for judging the producibility of target compounds, that is, 
connectivity model, flow model, and Boolean model. For each of them, 
Minimum Reaction Insertion (MRI) problem can be defined for 
finding the minimum number of additional reactions from a 
reference metabolic network to a host metabolic network so that a 
target compound becomes producible in the revised host 
metabolic network. In the connectivity model such as [16], the 
producibility of target compounds is judged by the connectivity 
between the source and the target compounds. After the source 
and the target compounds are connected by the additional 
reactions, the producibility is often evaluated by such a flow model 
as flux balance analysis (FBA) or an elementary mode [17], in 
which the sum of incoming flows must be equal to the sum of 
outgoing flows for each compound and the ratio of the amount of 
substrates and products must satisfy the coefficients given in each 
chemical reaction formula. In the Boolean model, each reaction 
occurs if all its substrates are producible whereas each compound 
is producible if one of its producing reactions occurs [18]. The 
source compounds are called seeds and the producible compounds 
are called the scope of the seed. In this model, a Boolean function of 



"AND" is attached to each reaction node and "OR" is attached to 
each compound node in the metabolic networks. 

For example, suppose that there is a chemical reaction 
"A+B^C+D", where A and B are called substrates whereas C 
and D are called products. In the connectivity model, either A or B 
is necessary to produce C and D, whereas both A and B are 
necessary for the Boolean model. In the flow model including 
FBA, in addition to the condition that both A and B must exist, 
both C and D are necessary to be consumed by other reactions. 
Thus, each model outputs a different solution for producing 
desired compounds. 

From the view point of computational complexity, although the 
connectivity model is very simple and then applicable even to very 
large networks, its logical analysis ability is not strong since it 
cannot detect the lack of necessary substrates. The good point of 
the flow model is its computational efficiency since problems in the 
flow model can often be formalized by linear programming, for 
which there exist polynomial time algorithms [19]. However, these 
polynomial time algorithms are not applicable for MRI since 
discrete variables are necessary for representing additional 
reactions, although it is solvable by mixed integer programming 
[20]. 

Although the computational time of the FBA-based method for 
MRI is very small and scalable for genome-scale metabolic 
reconstruction [20], Boolean methods also have attractive features 
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Figure 2. An example of MRI. v,., and v ( , are the source nodes. 
doi:1 0.1 371 /journal.pone.0092637.g002 
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and are expected to complement the FBA-based method. Indeed, 
for the analysis of metabolic networks, many studies have been 
conducted to develop Boolean models. For example, Lemke et al. 
[21] studied the effect of deletion of each enzyme in the metabolic 
network of a Boolean model, and Smart et al. [22] considered 
almost the same problem from the viewpoint of the Boolean aspect 
of the flux balance model. Li et al. [23] and Sridhar et al. [24] have 
developed methods for finding a set of enzymes whose inhibition 
stops the production of the target compounds with a minimum 
elimination of the non-target compounds. Lee et al. [25] and 
Takemoto et al. [26] estimated the distribution of the size of the 
effect of the deletions of enzymes using a branching process. 

As for the shortcoming of the FBA-based method for MRI, it 
tends to be considerably affected by the redundancy of the given 
metabolic network since each node is affected not only by the 
incoming flows but also by the outgoing flows. For example, 
suppose that a metabolic network of Fig. 1 (A) is given, where 
circles and rectangles represent compounds and reactions respec- 
tively. In order to produce the target compound from the source 
compounds, {Rl, R2, R3, R4} is necessary in the flow model 
including FBA, whereas either {Rl, R4} or {Rl, R2, R3} is 
sufficient for the Boolean model. Moreover, in the metabolic 



network of Fig. 1 (B), {R1,R2,R3} is necessary for FBA whereas 
{R2} is sufficient for the Boolean model. 

Therefore, in this research, we study the problem of designing a 
pathway for producing target compounds in metabolic networks of 
the Boolean model since its logical analysis ability is more stable 
than that of the FBA, particularly when the flexible parts of the 
metabolic networks are large. Our approach is based on (a), that is, 
the combination of existing pathways. In our problem setting, a 
base metabolic network of a host organism, which we call the host 
network, is given; it cannot produce the target compound in its 
initial form. However, an integrated metabolic network of many 
other organisms are given as the reference network from which we 
should find the minimum number of additional reactions so that 
the target compound becomes producible. We prove that this 
problem is NP-complete. 

Although both the FBA-based model and the Boolean model for 
MRI are considered to be NP-complete, the former is likely to 
have a faster exponential time algorithm than the latter since FBA 
has fewer integer variables. Although the computational complex- 
ity of the Boolean model is large, we develop an efficient method 
based on integer programming (IP) [27,28], which is often used as 
a formalization of NP-complete problems and there is an efficient 
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Figure 3. The cycles are decomposed in the FVS-based method. 
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Figure 4. A special case where there is no cycle. (A) An example where a contradiction occurs if the notion of time is not used. (B) 

Decomposition of a border node to avoid the contradiction. 

doi:10.1371/journal.pone.0092637.g004 



free solver for IP called CPLEX [29]. We also conducted four 
computer experiments in which the metabolic network of E. coli is 
used as the host network and the reference pathway of the KEGG 
database [30] is used as the reference network, and propanol, 
butanol, sedoheptulose 7-phosphate, and maleic acid are used as 
the target compound in each experiment. The results of the 
experiments show that (1) our IP-based method can appropriately 
solve MRI in the Boolean model; (2) solutions of MRI in the 
Boolean model are more suitable than those by connectivity based 
methods; (3) our IP-based method is applicable to large-scale 
networks where an exhaustive search does not work; and (4) 
solutions of MRI in the Boolean model tend to be smaller than 
those in the FBA-based model based on [31]. Our developed 
software is available at "http://sunflower.kuicr.kyoto-u.ac.jp/ 
~rogi/ minRect/ minRect.html". 

Materials and Methods 

Problem Definition 

In this section, the main problem Minimum Reaction 
Insertion (MRI) in a Boolean model is first explained with an 
example and then mathematical formalization is described. 

Suppose that a metabolic network shown in Fig. 2 is given, 
where each rectangle (resp., circle) corresponds to a reaction (resp., 
chemical compound). For example, v r 4 is a reaction, its substrates 



are v c 3 and v c 5 and its products are v c 6 and v c l- Black circles V c l 
and v C 2 denote the source nodes and are assumed to be provided 
by the external environment. On the other hand, a gray circle v c j 
represents a target compound and the purpose of MRI is to make 
the target compound producible. However, initially only the host 
network, which is shown by the dotted rectangle, is available. 
Since only V C1 ,V C2 ,V C 3 and V r i are included in the host network, the 
target compound v c -j is not producible. Instead the entire network 
is called the reference network and reactions not included in the 
host network can be added later. In MRI, the minimum number 
of additional reactions should be determined to make the target 
compound producible. In this example, the addition of 
{ v r2, v r3, v r4} is the optimal solution. The difficult point of MRI 
is how to deal with the effect of cycles. In the example of Fig. 2, the 
addition o{{v r 4,V r $} looks like the optimal solution. However, this 
solution is not appropriate since it relies on the cycle consisting of 
{v c 6,v r 5,v c 5,v r 4} and v c 7 is not producible unless the initial amount 
of v c 6 is sufficiently large. 

MRI is mathematically defined as follows: A metabolic network can 
be represented by a directed graph G = ( V,E). There are two 
types of node sets V c and V T , where V c denotes a set of compound 
nodes and V r represents a set of reaction nodes. V = V c \JV r and 
V C C\V, = {} hold. The neighbors of compound nodes must 
be reaction nodes, and the neighbors of reaction nodes must be 
compound nodes. Let Fj<= V c be a set of source nodes and v t eV c be 
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Figure 5. Propanol (C00479) becomes producible from glycolysis and gluconeogenesis by the addition of 1/, = {R01514, R01752, 
R01036, R01048, R02577, R02376}. 

doi:10.1371/journal.pone.0092637.g005 



a ftztgrf reorfe. Source nodes have no incoming edges and correspond 
to the seed compounds of [18]. In this study, we assume that 
source nodes are producible at any time. 

Suppose that a host network G\=(V\ ,E\ ) and a reference network 
G2 = (Vi,E-i) are given where G\ and G2 are metabolic networks, 
and G\ is a subgraph of G2 induced by V\ . V c (resp., V' r ) is a set of 
compound nodes (resp., reaction nodes) in V2 — V\ and is called 
the set of additional compound nodes (resp., additional reaction nodes). 

Let V a <= V' r be a set of additional reaction nodes. In the 
Boolean model, each node is assigned either "0" or "1". For a 
compound node, "1" means producible and "0" means not 
producible. As for a reaction node, "1" means active and "0" 
means inactive. Let A be such an assignment (that is A is a 
function from V to {0,1}). For each node veV, we write v = 0 
(resp., v = l) if 0 (resp., 1) is assigned to v. A is called a valid 
assignment if the following conditions are satisfied: (i) for each veV s , 
v= 1. (ii) for each veV c — V s , V= 1 if and only if there is u such that 
(u,v)eE and u= 1. (iii) for each veV r , v= 1 if and only if veV a {J V\ 
and u= 1 holds for all u such that (u,v)eE. This implies that each 
reaction node corresponds to an "AND" node and each 
compound node corresponds to an "OR" node. 

If G2 has no directed cycles, a valid assignment is uniquely 
determined for each V a . However, if G2 has a directed cycle, 



multiple valid assignments may exist. Let us call v,eK, and 
VjeV c — V s source connected if there is a directed path from v,- to Vj, 
and the values of the nodes included in the path are all 1 . There 
exist valid assignments where the values of nodes in a directed 
cycle are 1 even if these nodes are not source connected. In order 
to avoid such a case, we use the notion of minimal valid assignment, 
which is similar to the notion of maximal valid assignment defined 
in [32]. A valid assignment A is called minimal HA is valid and 
{v|v=l,veK} is minimal with respect to the inclusion relation- 
ships for sets. 

Now we define the Minimum Reaction Insertion as 

follows: 

• Input: A host metabolic network G\={V\,E\), a reference 
metabolic network G2 = ( ViJii), and a target compound v t . 

• Output: A minimum cardinality set of V a for which v t = 1 is 
satisfied in the minimal valid assignment of the induced 
subgraph of G 2 by KiU V C U V a . 

As mentioned in the section of Theoretical Results, a minimal 
valid assignment is uniquely determined if V a is given. However, 
solving MRI is not easy since the number of candidate V a is 2' K ' 
and MRI is proved to be NP-complete. Since utilizing software 
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Figure 6. When the target compound was C05382, MR I selected R01827 and R01830 from 66 candidates for the additional 
reactions whereas the shortest path-based method (PathComp) selected only R01827. 
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Figure 7. The comparison between the Boolean model and the FBA-based model [20]. More reactions are necessary for the FBA-based 
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Figure 8. The polynomial time reduction from minimum vertex cover (MVC) problem to minimum reaction insertion (MRI) problem. 

(A) An instance of MVC. (B) The corresponding instance of MRI. 
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packages of Integer Programming (IP) is efficient for solving NP- 
complete problems, we develop a method of IP formalization for 
solving MRI. Since the computational time of the IP-based 
method is considered to be exponential in terms of the number of 
variables, it is important to develop an IP formalization of MRI 
with a small number of variables. To do so, our previously 
developed method for Minimum Reaction Cut (MRC) [32] 
may be useful although many modifications are necessary. 

MRC is a problem to find a minimum set of reactions that 
interfere with the production of target compounds [32] and is 
known to be NP-complete. Let m (resp., n) be the number of 
compound (resp., reaction) nodes. If we use m+n time steps to 
calculate the maximal valid assignment in MRC, the number of 
variables in IP is 0((m+ri) ). The feedback vertex set (FVS) is a 
node set whose removal makes a network cycle-free. In [32], we 



succeeded in reducing the number of variables to 0(f(f + «? + «)), 
where / is the size of the feedback vertex set and / is considerably 
smaller than m or n. If use of 0((m + ti) 2 ) variables is allowed in 
MRI, almost the same method as in MRC can be used. However, 
to reduce the number of variables in IP to Oifif +m+n)), many 
modifications are necessary since minimal valid assignment and 
maximal valid assignment have different features. 

Integer Programming-Based Method for Minimum 
Reaction Insertion 

Here, we show IP formalization methods for MRI in the 
Boolean model. To apply IP, problems must be formalized to 
maximize or minimize a given objective function which is a linear 
function of integer variables and constraints must also be given as 
linear equations or inequations of integer variables. 
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Suppose that the host network and the reference network are 
given as shown in Fig. 2. The simplest IP formalization IP-MRI-A 
for solving Minimum Reaction Insertion is as follows where 
the time step increases by 1 when the Boolean calculation is 
synchronously conducted for every node: 

IP-MRI-A 

Minimize 

TER2(0) + TER3 (0) + TER4(0) + TER5 (0) ( 1 ) 
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TX + FX=lforanyX (17) 

where every variable takes either 0 or 1. v„ = 1 (resp., v n = 0) at 
time step t is represented by TRi(?) = 1 (resp. FR/(?) = 1) and 
TR/(0+FR/(0 = 1 holds for any i and t. For example, TR2(1) = 0 
means that v r2 = 0 at time step 1 , and FR2( 1 ) = 1 automatically 
holds at the same time. In the implementation, FR/(?) is replaced 
with l-TRi(?) to reduce the number of variables. Similarly, the 
values of compound nodes are represented by TCi(t) and FCi(f). 
For example, FC4(3)= 1 means that v C 4 = 0 at time step 3. 

(3) represents the Boolean relation v,i((+l) = v c 2(()Av c j((). 
Since Boolean relations such as "A" or "V" cannot directly be 
used in IP, it is necessary to convert them into linear equations 
and/or inequations. Since X\ = x 2 A x 3 A • • • A x^ can be 
represented by (x\ V xj V xj V ■ ■ • V xjt) A (x[ V x 2 ) A (xf V x 3 ) 
A • • • A (xT V Xk)= 1, v r i(t+l) = v c2 (t) A v c5 (f) can be converted 
into (v r i0+ 1) V v^(7j V vvslYj) A (v rl (t+l) V v c2 (t)) A (v rl (?+l) 
Vv c 5(?)) = 1, and then (3) is obtained. 

For a compound node with indegree 1, the value of the 
predecessor node is just copied. For example, since v c3 has only 
one predecessor v>l, v&{t+ 1) is just copied from v r \(t) as shown in 
(8). Similarly, v c 4(f+l) is just copied from v r 2(i) as shown in (9). 

For a compound node with indegree more than 1 , it is necessary 
to convert the "V" relation into linear equations or equations. (10) 
represents the Boolean relation v c s(f+ 1) = v r i{t) V v>5(?)-Since 
x\ =x 2 V x 3 V • • • V Xk is represented by (xf V x 2 V x 3 V • • • V x^) 
A(xi Vxj)A(xi Vx0A---A(xi V^)=l, v c5 (f+ l) = v r3 (r)V 
v r s(t) can be converted into (v c s(f + 1) V v r3 (?) V v r s{t))f\ 
(y c5 (t+ 1) V v r3 (t)) A (v c5 (/+ 1) V v r5 (0)= 1, and then (10) is 
obtained. 

As for the reaction nodes not included in the host network, 
TER/(?) and FER;'(/) are used to represent whether v r ; is 
activated. We use a virtual node v ei as one of the predecessors 
of v r ,\ Since v„ is represented by an AND node, v« = 0 keeps v r; - 
inactive even if all other predecessors of v r ; are 1 . For example, v r2 
in Fig. 2 has only one predecessor v c \. However, since v r2 is not 
included in the host network and v e , = 1 is necessary for v r2 = 1 , 
v r2 (t+\) = v c \(i) f\ v e2 (i) must hold, and then (4) is obtained. 

Since we assume minimal valid assignment, at t = 0, the source 
compound nodes are assigned 1, but the other compound nodes 
and reaction nodes are assigned 0. 

m + n is the largest number of time steps necessary for the 0-1 
assignment to converge. (1) means that the number of additional 
reactions should be minimized. (2) means that the target 
compound v c -j should become 1 after the 0-1 assignment 
converges. (3)-(7) represent the constraints by v r \ to v r 5 
respectively. Note that v e \ does not exist since v r \ is included in 
the host network and then v,i = 1 holds for any V a . (8)-(12) 
represent the constraints by v f3 to v c -j respectively. (13) represents 
that V a does not change by time transition. (14) means that v c \ and 
v c2 are source nodes. (15)— (1 6) represent that all nodes but source 
nodes are assigned 0 in the initial state. (17) means that "T" and 
"F" represent "true (1)" and "false (0)" respectively, and 
complement each other. 

The above formalization can clearly solve MRI and obtain the 
correct solution V a = {v r2 ,v r j,v r n}, however 0((m + n) 2 ) variables 
are necessary. To reduce the number of variables, it is necessary to 
reduce the number of time steps. If time is not taken into account 
at all, the following inappropriate IP formalization IP-MRI-B is 
obtained. 

IP-MRI-B 



Minimize 

TER2 + TER3 + TER4 + TER5 (18) 

Subject to 

TC7 = 1 (19) 
TR1 + FC2 + FC5>1, 

FR1+TC2>1, (20) 
FR1+TC5>1 

TR2 + FC1+FER2>1, 

(21) 

FR2 + TCI > 1 , FR2 + TER2 > 1 
TR3 + FC4 + FER3>1, 

(22) 

FR3 + TC4 > 1, FR3 + TER3 > 1 

TR4 + FC3 + FC5 + FER4 > 1 , 

FR4 + TC3>1, (23) 

FR4 + TC5 > 1 , FR4 + TER4 > 1 

TR5 + FC6 + FER5>1, 

(24) 

FR5 + TC6 > 1, FR5 + TER5 > 1 



TC3 = TR1 (25) 

TC4 = TR2 (26) 
FC5 + TR3 + TR5>1, 

(27) 

TC5 + FR3 > 1 ,TC5 + FR5 > 1 

TC6 = TR4 (28) 

TC7 = TR4 (29) 

TC1 = 1,TC2=1 (30) 

TX + FX=lforanyX (31) 



When compared to IP-MRI-A, (18),(19),(20)-(24), (25)-(29), 
(30), and (31) correspond to (l),(2),(3)-(7), (8)-(12), (14), and (17) 
respectively although the notion of time is not used in IP-MRI-B. 
(13) in IP-MRI-A means that the value of v e , does not change in 
the time transition, but this constraint is not necessary for IP-MRI- 
B since it does not have the notion of time. Moreover, neither (15) 
nor (16) of IP-MRI-A is used in IP-MRI-B. 

In IP-MRI-B, the solution of IP is V a = {v r4 ,v r5 } since 
(v r i,...,v r5 ,v c i,...,v c7 ) = (l,0,0,l,l, 1,1,1,0,1,1,1) is a valid 
assignment and satisfies v c q = 1 . Note that v r2 and v r3 are forced 
to be 0 since they are not included in either the host network or 
V a . Although it satisfies all constraints and | V a \ is minimum, this 
assignment is not appropriate since {v r i,,v c k,v r s,v c s} forms a cycle 
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and all of them are assigned 1 without the influence of source 
nodes. To avoid such an inappropriate assignment, it is necessary 
to consider minimal valid assignment with respect to the number 
of 1 s for each V a . As shown in the section of Theoretical Results, 
the minimal valid assignment is uniquely determined for each V a . 

Thus, IP-MRI-A can solve MRI, but m + n time steps are 
necessary, while IP-MRI-B, which does not use the notion of time, 
cannot solve MRI. The feedback vertex set (FVS) is a set of nodes 
whose removal makes the network acyclic. Since IP-MRI-B can 
solve MRI if there is no cycle, it is reasonable to apply IP-MRI-B 
for the acyclic network obtained by the deletion of FVS and use 
the notion of time as in IP-MRI-A to nodes included in F based on 
the idea developed in [32]. 

In the improved method, IP-MRI-C, we firstly find an FVS F 
consisting of reaction nodes and then decompose each v r i£F into 
two nodes v ri and v„ so that v ri has only in-edges and v„ has only 
out-edges. For example, in the network of Fig. 2, since F = {v r 4} is 
a feedback vertex set, v r n is decomposed into v r 4 and v,4 as shown 
in Fig. 3. Furthermore, we put an additional constraint that 
v si (t+ 1) = v„(0- The number of time steps of IP-MRI-C is / + 1 
while that of IP-MRI-A is m + 1, where / = |.F|. Therefore, the 
numbers of variables in IP-MRI-C and IP-MRI-A are 
0(f(m + n+f)) and 0((m + n) 2 ) respectively. Since / is consid- 
erably smaller than m + n in most metabolic networks and the 
computational time of IP exponentially increases with the number 
of variables, we can expect a significant improvement from the 
view point of the computational time. 

Although finding the minimum FVS is known to be NP- 
complete, it is not necessary to use the minimum FVS in our 
problem setting. We use a simple greedy algorithm to choose FVS 
as follows: 

Procedure GreedyFVS(G = (V,E)), where V = {v\,. . . ,v„} and 
E = {e u . . . ,<?,„} 

for i = 1 to n do 

v[*l=0; 
for i = 1 to m do 



= 0: 



while there exists e^eE such that e[k] = 0 
v[/] = l; 

if there exists ejt = (vj,v/) such that e[&]=0 and 
v\j] = 1 then 

E'^E-{e k }; 

V'^VU{v'j}; 

for all (Vj,v q )eE' do 

£M#-{(v,,v,))U{(v;,v,)}}; 

Greedy FVS(G' = (V',E')); 
else if there exists e k = (v,-,V/) e E such that 
e[k}=0 and v[/]=0 do 

«[*]«- 1; 
*'<-/; 

else if there exists vjeV such that v[j] =0 do 
for the minimum j; 

return G = (V,E); 



Since the reaction nodes for FVS are chosen by a greedy 
algorithm, the size of FVS is not always optimal. However, it is 
important to note that even if the size of FVS is not optimal, the 
solution of MRI calculated by IP-MRI-C is always optimal. If 
there are multiple optimal solutions in MRI, there is a possibility 
that the solutions are different since IP outputs only one solution. 
However, it may be possible to enumerate all optimal solutions of 
MRI by iteratively solving IP with a constraint to avoid the already 
chosen solutions. 

For example, IP-MRI-C for Fig. 2 is as follows, where v r 4 is 
decomposed into v r 4 and v s 4, and time step increases by 1 only 
when the value of v r 4 is copied to v.,4. 

IP-MRI-C 

Minimize 

TER2(0) +TER3(0) + TER4(0) +TER5(0) (32) 



Subject to 

TC7(1) = 1 

for all 7 = 0,1 

TR1 (t) + FC2(t) + FC5(t) > 1 , 

FRl(t)+TC2(t)>l, 
FRl(t)+TC5(t)>l 

TR2(t) + FC1 (t) + FER2(t) > 1 , 

FR2(t) + TCI (t) > 1 , FR2(t) + TER2(t) > 1 

TR3(t) + FC4(t) + FER3(t) > 1, 

FR3(t) +TC4(t) > l,FR3(t) +TER3(t) > 1 



TR4(t) + FC3(t) + FC5(t) + FER4(t) > 1 , 
FR4(t)+TC3(t)>l, 

FR4(t) + TC5(t) > 1 , FR4(t) + TER4(t) > 1 

TR5(t) + FC6(t) + FER5 (t) > 1 , 

FR5(t) + TC6(t) > 1, FR5(t) + TER5(t) > 1 

TC3(t)=TRl(t) 

TC4(t)=TR2(t) 

FC5(t) + TR3(t) +TR5(t) > 1, 

TC5(t) + FR3(t) > 1, TC5(t) + FR5(t) > 1 

TC6(t)=TSR4(t) 



(33) 



(34) 



(35) 



(36) 



(37) 



(38) 



(39) 



(40) 



(41) 



(42) 
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TC7(t)=TSR4(t) (43) 

TER2(t+l)=TER2(t), 
TER3(t+l)=TER3(t), 

(44) 

TER4(t+l)=TER4(t), 
TER5(t+l)=TER5(t) 

TSR4(t+l)=TR4(t) (45) 

TCl(t) = l,TC2(t) = l (46) 

TC7(0)=0 (47) 

TSR4(0)=0 (48) 

TX + FX=lforanyX (49) 



where each variable takes either 0 or 1 . 

When compared to IP-MRI-A, (32),(33),(34)-(38), (39)-(43), 
(44), (46), and (49) correspond to (l),(2),(3)-(7), (8)-(12), (13), (14), 
and (17), respectively. {v^} is chosen as a feedback vertex set, and 
then decomposed into v r 4 and v s 4 as shown in Fig. 3. 

Note that the number of time steps is 2 =/+l, and TSR4(?) = 1 
represents v,4 = 1 . In (42)-(43), the constraints for v c e and v c i are 
represented by the variable corresponding to v.,4 instead of that to 
v r 4. (45) represents that the time step increases by 1 when the value 
of v r 4 is copied to v,4 in Fig. 3. (48) represents v,4(0) = 0 to obtain 
the minimal valid assignment. 

Additionally, if we use the FVS-based method and no cycles are 
included in G\ and G2, the number of necessary time steps is only 
one. For example, suppose that G\ and G2 are as shown in Fig. 4 
(A). In this case, the correct solution of MRI is 
V a = {v r 2,v r 3,v r 4,v r 5}. However, if we set TC1(0)=1 and 
TC6(0) = 0, IP can output no solution since the condition 
TC6(0) = 1 is never satisfied. On the other hand, if we set 
TC1(0)= 1 and TC6(0) = 1, an inappropriate solution V a = {} is 
obtained by IP. To avoid such a case, in our method, if one of the 
predecessors of an additional reaction node v r is included in the 
host network, we decompose v r as if it were included in FVS. For 
example, in the network of Fig. 4 (A), v r i is decomposed into v r 2 
and v s 2 as shown in Fig. 4 (B) so that the values of the source nodes 
and the target node are calculated in different time steps. 

Results 

Computer Experiments 

We conducted computer experiments for solving MRI with data 
downloaded from the KEGG database. The experiment was 
conducted on a PC with an Intel(R) Xeon(R) 3.33 GHz CPU and 
10 GB RAM having the SUSE Linux (version 12.2) operating 
system, where CPLEX (version 12.4.0.0) was used as the solver of 
integer programming. 



In this study, a reference network consists of the central 
metabolism and the related modules necessary for producing the 
target compound. A map of the KEGG PATHWAY is a 
minimum unit, and three or four maps of the KEGG PATHWAY 
are chosen and integrated as the reference network in each of our 
experiments. As for species, a reference network includes the 
chemical reactions of all species, whereas the metabolic networks 
of E. coli are used for the host networks. The major difference 
between the pathway alignment methods by KEGG and our 
developed method is that our method is based on a Boolean 
model, whereas the pathway alignment methods consider only the 
topology of networks. 

In synthetic biology, it is of great interest to construct a minimal 
genome that realizes the desired functions [33-35]. Since 
glycolysis, gluconeogenesis, citrate cycle and pentose phosphate 
pathway are considered to be essential even in artificial organisms, 
it is reasonable to assume that the host networks in the computer 
experiments have some of these pathways in one of the simplest 
organisms, E. coli. Because the purpose of this study is not focused 
on the reconstruction of genome-scale metabolic network model, 
but the design of a minimal genome in addition to the existing 
pathways to produce a desired compound, each reference network 
consists of the maps of the KEGG pathway located between the 
central metabolism and each target compound. 

In the first computer experiment, the target compound is 
propanol (C00479 in KEGG ID), the host network is glycolysis 
and gluconeogenesis of E. coli (ecoOOOlO.xml), and the reference 
network covers glycolysis, gluconeogenesis and glycerolipid metab- 
olism of other species (koOOOlO.xml and ko00561.xml). The 
numbers of compound and reaction nodes are 58 and 85, 
respectively, where 30 reactions are reversible. The source nodes 
are D-glucose (C00031), oxaloacetate (C00036), salicin (C01451), 
arbutin (C06186), UDP-glucose (C00029), acyl-CoA (C00040), and 
diglucosyl-diacylglycerol (C06040), which are represented by black 
circles in Fig. 5. It took 0. 19 s to solve MRI. The obtained additional 
reactions are F a = {R01514, R01752, R01036, R01048, R02577, 
R02376}, where these reactions produce propanol from 3-phospho- 
D-glycerate (C00197) via glycerol (COO 116) as shown in Fig. 5. 
Since 3-phospho-D-glycerate (C00197) is producible by glycolysis 
and gluconeogenesis of E. coli and works as a connection between 
glycolysis and glycerolipid metabolism, the obtained V a can be 
considered an appropriate solution of MRI. 

Difference between Developed Model and Shortest Path- 
Based Model 

To show the difference between the developed model and the 
shortest path-based models, we conducted the second experiment 
where PathComp of KEGG ("http://www.genome.jp/tools/ 
pathcomp/") was used to calculate the solution of the shortest 
path-based model. In the experiment, the host network consists of 
glycolysis, gluconeogenesis and citrate cycle of E. coli 
(ecoOOOlO.xml and eco00020.xml), and the reference network 
consists of glycolysis, gluconeogenesis, citrate cycle and pentose 
phosphate pathway of other species (koOOOlO.xml, ko00020.xml 
and ko00030.xml). The numbers of compound and reaction nodes 
are 64 and 108, respectively, where 59 reactions are reversible. 
There are four source nodes, D-glucose(C00031), arbu- 
tin(C06186), salicin(C01451), and acetate (C00033), and the 
number of candidates for the additional reactions is 66. When 
the target compound is sedoheptulose 7-phosphate (C05382), as 
shown in Fig. 6, the solution of MRI is K a = {R01827, R01830}, 
where the substrates of RO 1827 are beta-D-fructose 6-phosphate 
(C05345) and D-erythrose 4-phosphate (C00279). It took 32.58 s 
to obtain the solution. Since D-erythrose 4-phosphate (C00279) is 
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not included in the host network, it is necessary to add R01830 in 
which substrates are beta-D-fructose 6-phosphate (C05345) and 
D-glyceraldehyde 3-phosphate (COO 118) and the products are D- 
xylulose 5-phosphate (C00231) and D-erythrose 4-phosphate 
(C00279). It is to be noted that both beta-D-fructose 6-phosphate 
(C05345) and D-glyceraldehyde 3-phosphate (COO 118) are pro- 
ducible by the host network. 

On the other hand, PathComp just connects the producible 
compounds and the target compound adds only R01827 since 
R01827 is adjacent to both beta-D-fructose 6-phosphate (C05345) 
and sedoheptulose 7-phosphate (C05382). However, it is clear that 
R01827 does not occur if D-erythrose 4-phosphate (C00279) does 
not exist. Thus the difference between the shortest path-based 
method and the developed method is that the developed method 
considers Boolean constraints for each reaction and compound 
whereas the shortest path-based method only considers the 
connectivity of nodes. 

Scalability 

Next, we conducted the third experiment to show the scalability 
of our method. The host network consists of the source nodes of 
glycolysis and gluconeogenesis of E. coli (ecoOOOlO.xml), that is, D- 
glucose(C00031), arbutin(C06186), salicin(C01451), oxaloaceta- 
te(C00036) and acetate (C00033). The reference network consists 
of glycolysis, gluconeogenesis, citrate cycle, pentose phosphate 
pathway and butanol metabolism of other species (koOOOlO.xml, 
ko00020.xml, ko00030.xml and ko00650.xml), where R01172 is 
treated as a reversible reaction. The target compound is butanol 
(C06142). The numbers of compound and reaction nodes are 93 
and 150, respectively, where 87 reactions are reversible. It took 
919.79 s (15m 19s) for the developed method to solve MRI and the 
solution was K a = {R00235, R00238, R01977, R03027, R01171, 
R01172, R03545}. These seven reactions form a path from 
acetate to 1 -butanol via acetyl-CoA, acetoacetyl-CoA, crotonoyl- 
CoA and butanoyl-CoA, which satisfies the Boolean constraints. 
Since the number of reactions in the reference network is 150, it is 
necessary to examine 150 C7 cases if an exhaustive search is 
conducted. Since examining 150C7 ~2.941 x 10" cases is almost 
impossible, it is seen that the IP-based method is useful for solving 
MRI, particularly when the given networks are not small. 

Difference between Developed Model and FBA-Based 
Model 

Finally, we conducted an experiment to show the difference 
between the developed model and the FBA-based model. We 
assume that the reference network consists of glycolysis, gluco- 
neogenesis, citrate cycle, pentose phosphate pathway and butanol 
metabolism of other species (koOOOlO.xml, ko00020.xml, 
ko00030.xml and ko00650.xml), and the host network includes 
only one reaction R04394 between salicin (C01451) and salicin 6- 
phosphate (C06188) as shown in Fig. 7. Therefore, the source 
node is only salicin (C01451). Note that reversible reactions are 
decomposed into two reactions, and denoted by P and Q, The 
target compound is maleic acid (CO 1384). The numbers of 
compound and reaction nodes are 93 and 150, respectively, where 
87 reactions are reversible. 

Then, the solution of MRI in our Boolean model is {R05134, 
R02736, R02035, R02036, R05605, R00344, R00342, R01082, 
R01087}, whereas the solution of FBA-based model is {R05134, 
R02736, R02035, R02036, R05605, R01058, R01518, R00658, 
R00200, R00344, R00342, R01082, R01087}. It is to be noted 
that {R01058, R01518, R00658, R00200} is not necessary for the 
Boolean model, but necessary for the FBA-based model. In the 



Boolean model, R01058 is not necessary to produce CO 1384 since 
the lack of reactions in downstream does not affect. However, in 
the FBA model, R01058 is necessary. Otherwise, COO 118 is not 
consumed and then R05605 (denoted as Q05605 in Fig. 7) cannot 
occur. Thus, the solution of MRI in the FBA-based model tends to 
include more reactions than that in the Boolean model. It took 
7896.46 s (2hllm36s) to solve the Boolean model of MRI. 

Theoretical Results 

Although solving IP is NP-complete, a problem that can be 
formalized as IP is not always NP-complete. Therefore, in the 
following paragraphs, we prove that MRI is NP-complete and 
show the appropriateness of formalizing MRI of the Boolean 
model as IP. 

Theorem 1: Minimum Reaction Insertion is NP- 
complete even when the maximum indegree and outdegree are 
bounded by 2. 

Proof: Since the problem is clearly in NP, it suffices to show NP- 
hardness. The proof is by a polynomial time reduction from 
minimum vertex cover (MVC), which is a problem for a given 
graph to find the minimum number of nodes so that each edge is 
incident to at least one of the selected nodes. For example, for the 
graph shown in Fig. 8 (A), {vi,V5,Vg} is an optimal solution of 
MVC. 

Let G = (V,E) be an instance of MVC, where V = {vi, . . . ,v„} 
and E = {e\, . . . ,e,„}. We construct the corresponding MRI as 
follows. The host network G\ =(V CI \J V ri ,E\) is given by 

V ci ={cu,... ,C„i}U{ci2, • • • ,c„ 2 }, 



Vri={r u ,...,r„i}, 



Ei={{c n ,ni}\i= 1, • • • ,n}\j{{rn,c i2 }\i= 1, ■ • • ,«}• 

The reference network Gj = (V Cl (J V n ,E2) is given by 
V C2 = K q U{cj, . . . ,c m }U{c,}, 

V r2 = V ri U{n,...,r n }U{r t }, 
E 2 = Ei U {{c i2 ,n} | i = 1 , • ■ • ,«} 



U{{r/,c/}|if V/ is an end point of e,} 



U{{c,-,f- ( }|i = l,...,»i} 



U{r„c t }. 

For example, MVC for the graph shown in Fig. 8 (A) is 
converted into MRI shown in Fig. 8 (B). It is clear that this 
conversion can be done in polynomial time. 
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In the following paragraphs, we show that MVC for G has a 
solution of size z if and only if MRI has a solution in which 
|Pa| =z +l holds. When G has a vertex cover of size z, 
V a = {ri\vi is included in the vertex cover}U{''(} satisfies c t = l 
in the minimal valid assignment and |K a |=z+l holds. On the 
other hand, suppose that V a satisfies c, = 1 in the minimal valid 
assignment and \ V a \ =z + 1. Since r,eV a is necessary for c t = 1, z 
nodes are included in V a from {r\, . . . ,r n }. Since each Cj must be 
1 to satisfy r, = l, at least one predecessor of each c ; must be 
included in V a for each j. Since there is an edge between r, and cj 
if and only if v, is incident to ej, {v^eF,,} is a vertex cover of size 
z. Nodes whose degrees are more than 2 can be converted by the 
methods shown in Fig. 9. 

Theorem 2: Given a host network, a reference network and a 
set of additional reactions, a minimal valid assignment is uniquely 
determined. 

Proof: For any valid assignment A, the assignment obtained by 
assigning 0 to all nodes that are not source connected is also a valid 
assignment. On the other hand, for any valid assignment A, the 
assignment obtained by assigning 0 to a source connected node is 
not a valid assignment. Since source connected nodes V sc are 
uniquely determined for V a , V sc is a minimal valid assignment and 
uniquely determined. 

Discussion 

In this paper, we formalized an optimization problem MRI in a 
Boolean model with a notion of minimal valid assignment. We 
proved that MRI in the Boolean model is NP-complete and the 
minimal valid assignment is uniquely determined when V a is 
given. Since an exhaustive search cannot be used to solve MRI 
when the given networks are not small, we developed an IP-based 
method for MRI. To improve the scalability of the developed 
method, it is necessary to reduce the number of variables 
appearing in IP formalization since the computational time of IP 
is considered to be exponential to the number of variables. 
Although the simple IP formalization with the notion of time is 
useful for solving MRI, it needs 0((m + n) 2 ) variables in IP 
formalization. If the notion of FVS is used, the number of 
necessary time steps reduces to /, where / denotes the size of FVS, 
and the number of variables in IP is 0(f (m + n+ /)). Although the 
idea of using FVS is similar to [32], many modifications are 

References 

1. Soh KC, Hatzimanikatis V (2010) Dreams of metabolism. Trends in 
Biotechnology 28(10): 501-508. 

2. Bro C, Regcnbcrg B, Forstcr J, Nielsen J (2006) In silieo aided metabolic 
engineering of saecharomyecs eerevisiae for improved bioethanol production. 
Metabolic Engineering 8(2): 102-111. 

3. Lcc SK, Chou H, Ham TS, Lee TS, Keasling JD (2008) Metabolic engineering 
of microorganisms for biofucls production: from bugs to synthetic biology to 
fuels. Current Opinion in Biotechnology 19(6): 556-563. 

4. Alper H, Jin YS, Moxley J, Stcphanopoulos G (2005) Identifying gene targets for 
the metabolic engineering of lycopene biosynthesis in cscherichia coli. Metabolic 
Engineering 7(3): 155-164. 

5. Prather K, Martin G (2008) De novo biosynthetic pathways: rational design of 
microbial chemical factories. Current Opinion in Biotechnology 19(5): 468-474. 

6. Nakamura GE, Whited GM (2003) Metabolic engineering for the microbial 
production of 1,3- propanediol. Current Opinion in Biotechnology 14(5): 454— 
459. 

7. Ro DK, Paradise EM, Ouellet M, Fisher KJ, Newman KL, ct al. (2006) 
Production of the antimalarial drug precursor artemisinic acid in engineered 
yeast. Nature 440(7086): 940-943. 

8. de Boer AL, Sehmidt-Danncrt C (2003) Recent efforts in engineering microbial 
cells to produce new chemical compounds. Current Opinion in Chemical 
Biology 7(2): 273-278. 

9. Mijts BN, Sehmidt-Danncrt C (2003) Engineering of secondary metabolite 
pathways. Current Opinion in Biotechnology 14(6): 597-602. 



necessary since the minimal valid assignment and the maximal 
valid assignment have many different properties. 

We also conducted four computer experiments in which data 
were downloaded from the KEGG database, CPLEX was used as 
the IP solver, and propanol, butanol, sedoheptulose 7-phosphate, 
and maleic acid were used as the target compound for each 
experiment. The host network was a metabolic network of E. coli 
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